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Abstract 

We test a simplified, local version of the helix mo deli on two synthetic and 
two natural proteins, to study its efficiency in predicting the native secondary 
structure. The results we obtain are very good for the synthetic sequences, 
poorer for the two natural ones. This suggests that non-local terms play a 
fundamental role in determining the secondary structure, even if in some cases 
local terms alone may be sufficient. 



I. INTRODUCTION 

It is experimentally knowni^ that a protein, under proper solvent and temperature con- 
ditions, folds from any random-shaped state to its "native" state, whose three dimensional 
structure is unambiguously encoded in the amino acid sequence. This state is the only one 
in which the protein is biologically active, and is strictly related to the chemical function 
of the protein. Unfortunately, experimental determination of this state is usually a rather 
difficult task, and it would be highly desirable to know how to predict the structure just 
from the sequence. 

In spite of many years of efforts in the field, a clear understanding of protein folding has 
not been achieved yet. The best results in structure prediction are obtained by algorithms 
which compare the protein under study with a database of sequences of already known 
structure. This approach, even when successful, does not shed any light on the underlying 
physics. 

Because of the wide range of time scales involved (10^^"^ -i- lO^s and to the complexity of 
the system, ab-initio simulations of the folding process are today (and will most probably 
be for a long time) out of reac hi. They could help to understand the fast events involved in 
the folding process. 

Simple simplified physical models have been proposed to capture the most relevant as- 
pects of the problem. In these approaches the protein is usually described as a chain endowed 
with "charged" beads, representing the residues, which attract each other according to their 
nature, xiie^e model haw been stndied both on- and ofl-latticM) , resorting to Monte 
Carlo simulations. 

Most of the theoretical understanding of the thermodynamics and dynamics of the folding 
process comes form these models; yet their relationship with natural proteins is somewhat 
qualitative, since there is no well defined mapping between the configuration spaces of real 



and model proteins. Hence, it is difficult to say which results can be extended to real proteins 
and which are model dependent, and the debate is still openedlil^ on the identification of 
the relevant features that distinguish a good folder from a poor one. 

In a recent paperS we proposed a new model, which gives a coarse-grained description 
of a protein in terms of helices. This choice stems from the fact that the elements of 
native secondary structure can be well approximated resorting to one or a few helices. Even 
loops can be partitioned in smaller parts and approximated in such a way: of course their 
description will not be as good as that of a helices. However, loops are usually found at the 
surface of the native globule and are affected by less severe geometrical constraints, so a less 
precise representation should not be a major problem. 

The motivation for a coarse-grained description is related to the fact that a certain degree 
of redundancy is observed in structure encoding: several different sequences are known to 
fold to essentially the same structure. This suggests that an appropriate average description 
of the sequence could be enough to predict most of the features of the native states (even 
if the details of the three dimensional structure are probably related to close packing of 
side-chain, and cannot be easily captured in a simplified description). 

In this letter we use a simplified version of the helix model, where nonlocal interactions 
are neglected, to predict the secondary structure of two syntheticS and two natural proteins, 
which are known to fold into a "four-helix bundle". We aim on the one hand to test the 
reliability of our model, and on the other to understand the role of local periodicities of 
polar and non-polar residues in determining the secondary structure of the protein. 

The paper is organized as follows: in Sec. |I| we recall the main characteristics of the 
model, in Sec. |IT1| we test the efficiency of the local version of the model in predicting 
the secondary structure of the four proteins; finally, in Sec. IV, we briefiy summarize and 
comment our results. 



II. THE MODEL 

Considering that their secondary structure is a fairly general feature of native states, 
and that its elements can be well represented by regular helices, we describe any protein 
configuration as a continuous curve made of pieces of helices sequentially linked together. 
This description is particularly suited for a and 3io helices, but it can also be applied to 
/9-strands (which, in the ideal case, are helices with two residues per turn) and, to a lesser 
extent, to the finite class of tight-turns presently known and to coil regions, once they are 
divided into smaller parts. 

The equation of the curve representing the protein chain is assumed to be: 

v{s) = j:hs)Hs) , (1) 

i=l 

where the parameter s ranges from to A^, the total number of residues; bi{s) = 1 if 
s G)sj_i, Sj( and 6j(s) = 1/2 if s = Sj_i or s = whereas 6j(s) = if s ^ (sj-i, Sj). 
The hj are the helices expressed in their reference frame (ei j, 62,1, 63 j): 

hi{s) = tti [ {cos{ui{s - Sj-i)) - 1) ei_i + sin(ui(s - Si_i)) + 

Mi/ii(s - Si_i) eg^j] + hi_i(si_i) , (2) 
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labelled so that helix i starts at Si-i and ends at Sj, with sq = and sat^ = A^. Nh is the total 
number of helices, residues are labeled from 1 to A^, and the convention holds that a residue 
sitting at the junction between two helices belongs to the first one. We let rii = Si — Si_i 
denote the lenght of helix i. We define also 

L 

Ui = ai , , (3) 

'l + 



where L is the lenght of a peptide unit, so that the line element on each helix is hj ds = Lds. 
We assume the sign ai = ±1 of Ui positive for right-handed and negative for left-handed 
helices, while the product Uihi is always positive. We also ask that helices have the same 
lenght of the chain they represent, setting As = 1 for a peptide- unit move along the protein 
chain. This requirement implies that rii as defined above coincides with the number of 
residues in the helix. 

In order to write down a simple hamiltonian, we further simplify the model, resorting to 
the following variables: 

Nh the total number of helices 

ni = Si- Si-i {rii e \pi,P2]) 

li = \ (s» + Si.i + 1) (4) 
Vi = hi(si) - hi(si-i) 
B,= i(h,(si) +h,(si_i)) 

where p2 = N — [N^ — l)pi and pi = 3, since a helix cannot be defined with less than three 
residues, rii is the lenght of the i-th helix expressed in residues; i G [l,Ai'/i]; /j represents 
the position along the sequence of the center of the i-th helix; Vj is the vector joining the 
end-points of helix hj; Bj is the the spatial position of the middle point of Vj. 

Two other variables are necessary to specify the "shape" of a helix: a particularly useful 
choice is to introduce: 

Zi = — , (5) 

Ui 

Wi = Ui- 2Tx§{-u.i) , (6) 

where Ui = Lai{Kl + Tf)'^ (Ki, Ti are the constant curvature and torsion of the i-th helix) and 
is the Heaviside function. The definition of Wi, in Wi G [0, 2tt], allows us to remove the 
discontinuity between right and left-handed helices at m = ivr, which is model-induced but 
inevitable in a description of the chain in term of helices. The sequence enters the model 
through the variables qk {k = 1 . . .N) and p5^(/,w). The former are related to the nature 
of each residue k, and measure its coupling to the other residues, due to the fact that the 
Mijazawa-Jernigan interaction matrix0 can be written^ as: 

Mp„ = fiQ + fii{qp + q^) + fi2qpqa (p, a = 1, . . . , 20) . (7) 

Since we deal with entire helices at a time, and not with single residues, we introduce 
the average g of a helix, centered in /j = /, as 
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YJj=-m1l+j^ if / — 1, 2, 



91* J - ^ 1 fry 1 + 1 1 if / - i ^ 

I 2(2m+l) 2^j=-m\Hl-^+j ^ qi+l+j), H f — 2' 2' " " " 

(integer or half-integer values of / are the only ones allowed for the central points of the 
helices, the variable m is an arbitrary number, comparable with the mean lenght of the 
helices). 

The other variables are defined by: 

1 " 

pl{l,w,n) = -— — - Qi+jQi+kCos{{j - k)w) , (9) 

{2^j=-n^l+j) j,k=-n 

where p±{l,w,n) is the projection on the plane perpendicular to the helix axis of the "hy- 
drophobic dipole moment", calculated at a point on the axis and normalized with respect 
to the total hydrophobic chargelll YJj=-nQi+ji where: Qp = fio/2 + fiiQp + (/i2/2)g^. 

The quantity reveals the prevalence of non polar residues on one side of the helix, 
characterized by the periodicity w. 

The following constraints hold among the variables previously defined: 

1. the sum of the residues of all the helices must be equal to the total lenght of the chain: 

Y.n,-N = 0; (10) 
1=1 



2. the lenght of Vi is related to the lenght and shape of the helix: 



vf-|h,(.,)-h,(.,_Or = v^-n^L^ 



, (11) 



where 6i = niUi/2; 



3. the end of one helix must coincide with the beginning of the following one, both in 
sequence and in space: 

k-k-i-"^^^^^ =0 . (13) 



In these equations, i ranges from 1 to N^, and, to be consistent with the definitions of /j, we 
set Iq = 1/2, uq = 0. 

With the above defined variables we write a hamiltonian of the form: 

Nh Nh 

H = + J2{H^ + Hl)+ H.,j , (14) 

i=l i<j=2 

where we have defined: 
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nn 



l2{Nh - 1) 




{Ui - 1)70 Ci {{Wi - C2f - C3) +C4 + C5 - Cq + CY{Wi - Cg)^) 




Here 7^ are dimensional parameters weighting the various contributions, while are known 
adimensional constants and ABij = |Bj — Bj|. 

Costraints will be implemented explicitly, by direct substitutions of the variables in 
the above hamiltonian, which will eventually be written as a function of the independent 
variables. 

A detailed discussion of the various terms appearing in Eq. ([T^) has been given else- 
where0; here we mst recall that recovers in an effective way the the experimental Ra- 



machandran plotli^l, thus dictating which kind of helices are more likely to be formed. H^, 
on the other hand, is sequence dependent and favours the separation of polar and non polar 
residues on the helices: P{li,Wi) = J^{p\{li,Wi,n)) is some simple function of p^(/j,Wj,n). 

Hnn represents an extremely simplified way to keep next-neighbours interactions into 
account: a constant, positive energy is involved in helix breaking, independently on their 
orientation. Hij has the simple form of a square-well with an infinite barrier on one side, 
representing hard core repulsion between helices. The interaction, in the range ABij G 
[po,Pi] has the form of Eq. (|^), calculated with the average "charges" of the helices. 

For the sake of simplicity inter-helical hydrogen bonds are not distinguished from hy- 
drophobic interactions (hence we disregard their dependence on orientation), and both are 
described by Hij. 



We now consider Eq. (0) in the limit 71 <^ 70, without non-local interactions (73 = 74 = 
0) and at fixed number of helices Nh, and ask ourselves to what extent the correct native 
secondary structure can be recovered by local terms only. 

The former limit is equivalent to studying the ground state of = J2i^=i H} with only 
two allowed values (w^, wp) for each Wj, corresponding respectively to a and (3 configuration. 
We shall look for the values of (nj, Wj), at fixed Nh. which best represent the native secondary 
structure, in the cases of two synthetic sequencesEJ and of two natural proteins, identified by 
PDB codesS 2mhr (myohemerythrin) and 2asr (aspartate receptor, ligand binding domain). 
These proteins are known to fold in the "four- helix bundle" conformation. 

We assume that the function P{li,Wi), appearing in the expression of H^, has the form: 



We have chosen n = 3 in expression (^ since this involves calculating the hydrophobic dipole 
on an helix of seven residues, a reasonable lenght both for a-helices and for /?-strands. 




III. ANALYSIS OF THE FOUR PROTEINS 




1 

2' 



if li is an integer, 
Wi, 3) + p]_{li + |, Wi, 3) , a k = k + I, for integer k. 



(15) 
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First of all we plot P{1, Wa), P{1, wp) for all the proteins: Figures (||, ^) reveal that 
indeed a clear dominance of P(/, Wa) seems to be a sufficient condition for a-hehces, though 
not a necessary one. 

Then we study the ground state of the local hamiltonian Hi. we set 71 = 1 and exhaus- 
tively search the configuration space with Nh = 4, recording the best ten configurations we 
find. The choice of Nh is suggested by our a-phori knowledge of the native state of these 
proteins, and by the reasonable assumption that the existence of short turns is related rather 
to the three-dimensional structure than to sequence periodicity requirements, so that they 
could not be efficiently recovered by the local hamiltonian. 

To test the goodness of the configurations we find, we proceed as follows: first of all we 
divide each protein into four parts, corresponding to the four "arms" in the native bundle 
conformation, and look at those which are in a helical configuration (for 2asr, we consider 
the short 3io-helices together with a-helices). 

Then we consider our configurations and compare each element in the bundle with the 
corresponding native one, and count the residues that have been correctly predicted as 
belonging to an a-helix. If na is their number, the quantities: 

Ctot = , Orel = , (16) 

will give the percentage of success in relation respectively to the total number of residues 
and to the number n^"* of residues belonging to helices in the native state. 
We obtain the following results: 



protein 


energy 


(ni,n2,n3,n4) 


helix 


Ctot 


Crel 


seqB 


-20.290 
-20.170 


(25, 3, 27, 29) 
(25, 13, 19, 17) 


(a, a, /3, a) 
(a, a, a, a) 


0.42 
0.70 


0.55 
0.93 


seqF 


-19.203 
-19.062 


(5, 39, 13, 17) 
(11, 27, 19, 17) 


(a, a, a, a) 
(a, a, a, a) 


0.55 
0.69 


0.73 
0.91 


2asr 


-36.550 
-35.830 


(3, 3, 129, 7) 
(4, 3, 127, 8) 


(/?, P, a, a) 
{(3, a, a, a) 


0.24 
0.25 


0.27 
0.28 


2mhr 


-35.297 
-35.192 


(5, 47, 49, 17) 
(39, 13, 49, 17) 


{(3, a, a, P) 
(a, a, a, (3) 


0.24 
0.42 


0.34 
0.60 



For each protein the first line refers to the ground state, while the second refers to the 
configuration with the highest correlation to the native state, among the ten recorded. The 
most native-like conformations for the four proteins appear at position 3, 8, 9, 2 respectively, 
in the list of the best ten configurations. 

For both the synthetic sequences native a-helices correspond to residues (3-16; 22-35; 
41-54; 60-73); the secondary structure of 2mhr presents a-helices at positions (12-14, 19-37; 
41-64; 70-85; 93-109, 111-114); that of 2asr shows a-helices at positions (2-38; 49-72; 80-104; 
117-141), while residues 44-48, 77-79 are in 3io conformation. 



IV. COMMENTS AND CONCLUSIONS 



In this letter we addressed three questions: how good is the hydrophobic dipole moment 
in describing the relationship between sequence and secondary structure? What is the role 
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of local terms in the hamiltonian? Is it possible to predict the native secondary structure 
on the grounds only of the hydrophobic dipole? 

The results we obtain show that the choice of describing the sequence periodicity by 
means of p5^(/,w,n) and P{l,w) (Eqs.(|9|. |T5|) ) is substantially correct, both at a descriptive 
and at a more quantitative level. 

Indeed, a qualitative correlation is evident between regions where P{1, Wa) dominates and 
the position of a helices, in all the proteins considered. It is however not straightforward 
to describe this correlation quantitatively, since it is not easy to unambiguously express 
in mathematical language what one should recognize as "dominant". For this reason, we 
cannot exclude that better definitions than Eq. (p3|,pD may be found to characterize local 
periodicities in the sequence, even if we consider our choice to be a reliable one. 

Moreover, we have introduced an objective way to assess how similar is the ground state 
to the native one, and indeed the minimal energy configurations we find suggest that our 
variables and hamiltonian are not so bad in describing the system. 

It can indeed be noticed that, despite the strong simplifications introduced in considering 
only a and f3 helices and in taking Nh = 4 (that forbids a simultaneous description of both 
the helices and the turns), we obtain good results for the two synthetic sequences: among 
the low energy states a configuration is found which shows a high degree of correlation to 
the native secondary structure, and the fact that this configuration is not the ground state 
can be considered a minor problem, at this level of simplification. 

The results for 2mhr and 2asr, on the other hand, leave us with several open ques- 
tions about the relative importance of local and nonlocal terms in the hamiltonian. The 
hydrophobic moment diagrams Fig.(^ ^ are more complex than those for the synthetic pro- 
teins, which could signal a minor importance of the local terms with respect to the nonlocal 
ones. Indeed, it is commonly believed that the secondary structure results from the need to 
maximize compactness of the protein and protection of the non-polar residues from water. 
According to these ideas the periodicity of the sequence could be an outcome of evolution, 
useful to remove a possible source of frustration and prevent misfolding, while increasing 
the stability of the native state; yet proteins need not be optimized with respect to their 
periodicity. 

On the other hand, the poor results we obtain with these proteins could also be partially 
due to the approximations introduced, and we cannot exclude that better predictions could 
be obtained just resorting to a more complete expression of the local terms. A more definite 
answer to the above questions is left to future efforts. 
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FIGURES 

FIG. 1. Plot of P{l,Wa) (continuous line) and P{l,Wj3) (dotted line) for the sequence seqB. 
FIG. 2. Plot of P{l,Wa) (continuous line) and P{l,Wjj) (dotted line) for the sequence seqF. 
FIG. 3. Plot of P{l,Wa) (continuous line) and P{l,wp) (dotted line) for the protein 2asr. 
FIG. 4. Plot of P{l,Wa) (continuous line) and P{l,Wfj) (dotted line) for the protein 2mhr. 
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